Voiced Speech Synthesis Using Pitch Asynchronous Code Excited Linear Filters for the Glottal Source

نویسندگان

  • Arun Kumar
  • Sandeep Kumar
  • Indra Narayan Kar
چکیده

This paper proposes a model for natural quality voiced speech synthesis using code excited linear all-pole filter for modeling the glottal source signal. Classical glottal signal models are explicit-time functions which inhibit joint sourcetract parameter estimation and require pitch synchronous estimation with precise segmentation of open and closed glottis phase. These problems are overcome in the proposed implicit-time glottal model. It is found that a switched, code excited linear all-pole filter for the glottal signal gives near natural quality voiced speech synthesis. Both objective and subjective performance results will be presented. INTRODUCTION This paper presents objective and subjective performance analyses results of code excited linear all-pole filters in modeling the glottal source signal for natural quality voiced speech synthesis. The basis of speech synthesis is the universal speech production model comprising of a time-varying linear all-pole filter excited by a source signal [1]. It is generally considered that unvoiced speech of “equivalent” perceptual quality as voiced speech can be synthesized under the same bandwidth constraint. Thus, the research interest focuses on the problem of natural quality voiced speech synthesis. The efficient modeling of voiced speech is important in the context of natural quality speech synthesis, low bit rate speech coding, and several speech analyses problems [2], [3]. Classically, glottal signal models used in these problems are explicit-time functions [4]. They have the following generic limitations: (a) pitch synchronous parameter estimation is needed, which in turn requires precise segmentation of open and closed phases of the glottis, (b) there is nonlinear dependence of model parameters on the signal which has bearing on estimation complexity, and, (c) it is difficult to combine an explicit-time function voiced source model with a standard implicit-time function synthesis filter for more accurate joint estimation of source and tract parameters which may account for source-tract interactions in voiced speech production. We attempt to overcome these problems of classical glottal models with the design of pitch asynchronous code excited linear all-pole filters that are implicit-time functions. PRIOR ART Several explicit-time glottal models have been proposed for diverse speech processing problems. In the context of voiced speech synthesis, glottal models are either non-interactive, or interactive according to the absence or presence of interaction between the glottal source and vocal tract, in the estimation of model parameters [4]. Some examples of non-interactive glottal models are Rosenberg’s trigonometric and polynomial models and Liljiencrants and Fant model, while Ishizaka and Flanagan’s “mechanical” model is an often used interactive glottal model. Schoentgen [5] proposed the homogeneous switched affine glottal model for voiced speech synthesis. The explicit-time function glottal model due to Liljencrants and Fant [6], given by, (2) ] [ (1) ) cos( ] [ 2 2 2 1 1 1 C K A n g C n K A n g n c n o + = + = ω is a solution of the homogeneous switched affine glottal model which is an implicit-time function model. Here, (1) and (2) give the glottal waveform models for the open and closed phase of the glottal cycle respectively, and ω , , , , , , 2 1 2 1 2 1 K K C C A A are the model parameters. Schoentgen’s model consists of two sub-models: (4) ] [ ]. 1 [ ] [ (3) ] [ ], 2 [ ] 1 [ ] [ 1 0 2 1 0 r d n g n g b b n g r d n g n g a n g a a n g ≥ − − + = < − − + − + = where, g[n] is the glottal signal, and are model coefficients which are estimated on a frame-byframe basis. Within the frame, switching takes place between the two sub-models according to the given criterion. This model has the advantages of an implicit-time glottal model but it still does not produce natural quality voiced speech. Kumar and Gersho [7] generalized this model to an input driven switched affine model that significantly improves the voiced speech synthesis quality. In this paper, we further improve upon the performance of this model with the design of code excited linear all-pole filters for the glottal source signal. r d b b a a a , , , , , , 1 0 2 1 0 VOICED SPEECH SYNTHESIS METHOD USING CODE EXCITED LINEAR GLOTTAL FILTER Figure 1 shows the block diagram of the proposed voiced speech synthesis method. Linear prediction analysis-synthesis is used for synthesizing voiced speech. For purposes of studying the glottal source models, only frames corresponding to voiced speech are processed for source signal modeling, while unvoiced frames are concatenated at the synthesized output to facilitate perceptual analysis. The voiced speech signal is analyzed pitch asynchronously on a 20 ms frame basis to produce the vocal tract and glottal model parameters. The glottal signal g[n] is obtained by inverse filtering the speech signal s[n], followed by integration. A first order integrator with a=0.98 is used to obtain g[n]. The integrated LP residual g[n] is modeled by code excited linear all-pole filter. The filter types that we propose, can be classified according to the update schemes which are: (i) block adaptive, and, (ii) threshold switched. In the first update scheme, the autocorrelation method is used to estimate the filter parameters once per frame. The threshold switched linear all-pole model is given by: (6) ]} [ { median ]. [ ] [ ] [ (5) ]} [ { median , ] [ ] [ ] [

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis by synthesis speech coding with generalized pitch prediction

A new analysis-by-synthesis speech coding structure is presented for high-quality speech coding in the 4 to 8 kb/s range. CELP with generalized pitch prediction (GPP-CELP) di ers from classical code-excited linear prediction (CELP) in that for voiced segments it is the speech signal that is decomposed into a component predictable with the aid of the adaptive codebook (ACB) and a nonpredictable ...

متن کامل

Speech enhancement using voice source models

Autoregressive (AR) models have been shown to be effective models of speech signal. However, although it is the most common mode1 of speech, an AR process excited by white noise for speech enhancement, fails to capture the effects of source excitation, especidy the quasi periodic nature of voiced speech. Speech synthesis researchers have long recognized this ~roblern and have developed a variet...

متن کامل

A Review of Glottal Waveform Analysis

Glottal inverse filtering is of potential use in a wide range of speech processing applications. As the process of voice production is, to a first order approximation, a source-filter process, then obtaining source and filter components provides for a flexible representation of the speech signal for use in processing applications. In certain applications the desire for accurate inverse filterin...

متن کامل

Automatic pitch marking and reconstruction of glottal closure instants from noisy and deformed electro-glotto-graph signals

Pitch tracking and pitch marking (PM) are two important speech signal analysis techniques for several applications. The accuracy of both pitch marking and tracking is significant to generate smooth synthesized speech by controlling the pitch and duration of voiced speech in Text-to-Speech (TTS) system for example. In this paper, we present a novel hybrid approach, combining electro-glotto-graph...

متن کامل

Glottal source and vocal-tract separation Estimation of glottal parameters, voice transformation and synthesis using a glottal model

This study addresses the problem of inverting a voice production model to retrieve, for a given recording, a representation of the sound source which is generated at the glottis level, the glottal source, and a representation of the resonances and anti-resonances of the vocal-tract. This separation gives the possibility to manipulate independently the elements composing the voice. There are man...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002